This repository has been archived by the owner on Jan 8, 2024. It is now read-only.
Not interpreting server NotFound error as server down #4854
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #4460
How does the runner get stuck?
Here's the failure scenario:
1: The server and runner are both connected and fine.
2: A new job arrives, and the server decides to assign it to our runner.
3: Our first problem - the server has an issue assigning the job to a runner, and it's a NotFound-type error. We've seen it happen here, while getting config-sourcers.
waypoint/pkg/server/singleprocess/service_runner.go
Line 661 in 918df22
4: The runner receives this error, and because it's a NotFound error, it calls
goto RESTART_JOB_STREAM
(here), which we should only do if we've actually disconnected from the server. This is the bug. Relevant chunk:waypoint/internal/runner/accept.go
Lines 278 to 287 in 918df22
5: The runner follows the goto, and then blocks here forever:
waypoint/internal/runner/accept.go
Line 224 in 918df22
Why is the runner blocking there, you ask? Here's my best understanding:
If this were a true disconnect, there is a second goroutine that would also be attempting to reconnect - the recvConfig loop (here). It has sophisticated reconnect logic, and if it were to also have seen the disconnect, it would reconnect, and then increment the runner-wide state version here, which would unblock the accept goroutine. @martisah and I verified this through observation.
That isn't happening here, because the recvConfig goroutine hasn't seen an error case, so it never has a reason to modify the runner state.
How does this fix it?
The primary cause is that the runner's accept goroutine is interpreting any NotFound error from the server as a disconnect event, which it expects all other goroutines will see. That's a bad assumption.
This PR modifies the accept goroutine to exit if it sees a NotFound error, rather than attempt a server reconnect. This raised a new bug: the top-level
AcceptMany
function was interpreting the server sendingNotFound
on the runner job stream as a deregistration event:waypoint/internal/runner/accept.go
Lines 76 to 80 in 918df22
That's also a bad assumption. The server in fact sends FailedPrecondition (here).
This fixes that second bug by again interpreting NotFound as a generic error. The AcceptMany NotFound check was introduced to prevent a runner DDOS event (here), and this continues to protect against that by moving the sleep before reconnect to the default case.
tl;dr: with this change, if the server responds to the runner with a NotFound error, the Accept loop will log the error, exit, and then be restarted successfully by AcceptMany, ready to accept another job.
How can I reproduce this bug?
waypoint up
's, destroying the project, recreating the project, and repeating.server down before accepting a job, will reconnect
, and then refuse to accept additional jobs. You will notice continued logs fromwaypoint.runner.agent.runner.config.watcher
, but none from the accept loop.Future considerations